# Create a code chunk and set your working directory
setwd("~/Downloads/CapstoneProject-02")
OVERVIEW OF THE BELLABEAT COMPANY
Bellabeat is the go-to wellness brand for women, offering an ecosystem of products and services focused on women’s health. Founded by Urša Sršen and Sando Mur, Bellabeat aims to analyze data that could help unlock new opportunities and gain valuable marketing strategies.
STAKEHOLDER
The stakeholder include:
1. Urška Sršen, Bellabeat co-founder and Chief Executive Officer
2. Sando Mur, Mathematician and Bellabeat’s Co-founder
3. Bellabeat’s Marketing Analytics Team
Business Task
Bellabeat aims to analyze the usage data from one of its products to gain insights and make high-level recommendations that will inform its marketing strategy.
Questions for Analysis
Data Source:
The FitBit Fitness Tracker Data dataset by Mobius under the license CCO: Public Domain was used. This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.206 - 05.12.2016. Around 30 eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
Sorting & Filtering
My analysis is focused on trends in the usage of the app, which is why my analysis will be focused on user engagament.
library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(tidyr)
DailyActivity_01 <- read_csv("archive/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/dailyActivity_merged.csv")
## Rows: 457 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
DailyActivity_02 <- read_csv("archive/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
DailySleep <- read_csv("archive/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
HourlySteps_01 <- read_csv("archive/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlySteps_merged.csv")
## Rows: 24084 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
HourlySteps_02 <- read_csv("archive/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To prove that the file has successfully loaded, here are the first few rows of one of the dataframe, which is DailyActvity_01.
head(DailyActivity_01)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 3/25/2016 11004 7.11 7.11
## 2 1503960366 3/26/2016 17609 11.6 11.6
## 3 1503960366 3/27/2016 12736 8.53 8.53
## 4 1503960366 3/28/2016 13231 8.93 8.93
## 5 1503960366 3/29/2016 12041 7.85 7.85
## 6 1503960366 3/30/2016 10970 7.16 7.16
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
Next, let’s take a look at and check the sample size of each dataframe.
# Checking distinct values to find out the sample size
n_distinct(DailyActivity_01$Id)
## [1] 35
n_distinct(DailyActivity_02$Id)
## [1] 33
n_distinct(DailySleep$Id)
## [1] 24
n_distinct(HourlySteps_01$Id)
## [1] 34
n_distinct(HourlySteps_02$Id)
## [1] 33
From this information, it is evident that there are between 33 and 35 partcipants in most dataframes; however, the DailySleep dataframe contains only 24 users.
DailyActivity_01 and DailyActivity_02 are essentially identical, except for their different time periods. Therefore, I need to merge them into a single dataframe named ‘DailyActivity’. As well as DailyActivity, HourlySteps_01 & HourlySteps_02 are identical,except for their different time periods. Therefore, I need to merge them into a single dataframe too named ‘HourlySteps’.
# Merge DailyActivity_01 and DailyActivity_02 together
DailyActivity <- rbind(DailyActivity_01, DailyActivity_02)
# Merge HourlyStep_01 and HourlyStep_02 together
HourlySteps <- rbind(HourlySteps_01, HourlySteps_02)
I discovered an issue with the data types of the ‘ActivityDate’ columns across the DailyActivity datraframes, as well as ‘ActivityHour’ columns across the HourlySteps dataframes, and ‘SleepDay’ column across the DailySleep dataframe. Before proceeding with further analysis, I need to convert them from characters to a date format.
Before proceeding, I created a duplicate dataframe for every dataframe named beginning with Clean_DataframeName to ensure the original data remains unchanged.
# Create a new data frame for DailyActivity
Clean_DailyActivity <- DailyActivity
# Verify
head(Clean_DailyActivity)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 3/25/2016 11004 7.11 7.11
## 2 1503960366 3/26/2016 17609 11.6 11.6
## 3 1503960366 3/27/2016 12736 8.53 8.53
## 4 1503960366 3/28/2016 13231 8.93 8.93
## 5 1503960366 3/29/2016 12041 7.85 7.85
## 6 1503960366 3/30/2016 10970 7.16 7.16
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
library(lubridate)
# Convert ActivityDate to proper date format
Clean_DailyActivity$ActivityDate <- mdy(Clean_DailyActivity$ActivityDate)
# Check the results
head(Clean_DailyActivity$ActivityDate)
## [1] "2016-03-25" "2016-03-26" "2016-03-27" "2016-03-28" "2016-03-29"
## [6] "2016-03-30"
class(Clean_DailyActivity$ActivityDate)
## [1] "Date"
Now the data type of the ‘ActivityDate’ column in the Clean_DailyActivity frame is changed to Date.
Next, it’s time to change the data type of the ‘ActivityHour’ column in the HourlySteps dataframe, but first let’s create a new dataframe named ‘Clean_HourlySteps’ so that the original dataframe remains the same.
# Create a new data frame for HourlySteps
Clean_HourlySteps <- HourlySteps
# Verify
head(Clean_HourlySteps)
## # A tibble: 6 × 3
## Id ActivityHour StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 3/12/2016 12:00:00 AM 0
## 2 1503960366 3/12/2016 1:00:00 AM 0
## 3 1503960366 3/12/2016 2:00:00 AM 0
## 4 1503960366 3/12/2016 3:00:00 AM 0
## 5 1503960366 3/12/2016 4:00:00 AM 0
## 6 1503960366 3/12/2016 5:00:00 AM 0
Now, change the format to Date.
# Convert ActivityHour to proper date format
Clean_HourlySteps <- Clean_HourlySteps %>%
mutate(
ActivityHour = mdy_hms(ActivityHour), # converts "3/12/2016 10:00:00 AM"
hour = hour(ActivityHour)
)
# Check the results
head(Clean_HourlySteps$ActivityHour)
## [1] "2016-03-12 00:00:00 UTC" "2016-03-12 01:00:00 UTC"
## [3] "2016-03-12 02:00:00 UTC" "2016-03-12 03:00:00 UTC"
## [5] "2016-03-12 04:00:00 UTC" "2016-03-12 05:00:00 UTC"
class(Clean_HourlySteps$ActivityHour)
## [1] "POSIXct" "POSIXt"
Now the data type of the ‘ActivityHour’ column in the Clean_HourlySteps data frame is changed to POSIXct.
Last, I need to change the data type of ‘SleepDay’ column from character to date, but same like two other dataframe, I need to create a new dataframe so that the original dataframe remain the same vallues.
# Create a new data frame for DailySleep
Clean_DailySleep <- DailySleep
# Verify
head(Clean_DailySleep)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:0… 1 327 346
## 2 1503960366 4/13/2016 12:0… 2 384 407
## 3 1503960366 4/15/2016 12:0… 1 412 442
## 4 1503960366 4/16/2016 12:0… 2 340 367
## 5 1503960366 4/17/2016 12:0… 1 700 712
## 6 1503960366 4/19/2016 12:0… 1 304 320
Now, change the format to Date.
# Convert SleepDay to proper date format
Clean_DailySleep <- Clean_DailySleep %>%
mutate(
SleepDay = mdy_hms(SleepDay), # converts "3/12/2016 10:00:00 AM"
hour = hour(SleepDay)
)
# Check the results
head(Clean_DailySleep$SleepDay)
## [1] "2016-04-12 UTC" "2016-04-13 UTC" "2016-04-15 UTC" "2016-04-16 UTC"
## [5] "2016-04-17 UTC" "2016-04-19 UTC"
class(Clean_DailySleep$SleepDay)
## [1] "POSIXct" "POSIXt"
Now the data type of the ‘SleepDay’ column in the Clean_DailySleep data frame is changed to POSIXct.
After that, I need to find out if there are duplicate users within these 3 dataframes (Clean_DailyActivity, Clean_DailySleep, and Clean_HourlySteps)
#Checking any duplicate based on user + date on DailyActivity dataframe
sum(duplicated(Clean_DailyActivity[, c("Id", "ActivityDate")]))
## [1] 24
Turned out that 24 duplicate rows are appearing on the DailyActivity / Clean_DailyActivity dataframe with different values. The next step I take is to remove the row with the lowest value and keep the row with the highest value.
# Keep the row with max steps or values
Clean_DailyActivity <- Clean_DailyActivity %>%
group_by(Id, ActivityDate) %>%
slice_max(TotalSteps, n = 1, with_ties = FALSE) %>%
ungroup()
#Checking any duplicate based on user + date on DailySleep dataframe
sum(duplicated(Clean_DailySleep[, c("Id", "SleepDay")]))
## [1] 3
After running and checking the duplicate on the DailySleep / Clean_DailySleep dataframe, it appears that there were 3 duplicate rows. But unlike the Clean_DailyActivity, the duplicate in this dataframe has the exact value, so I removed the duplicate and only keep the first occurrence.
# Keep only the first occurrence, and remove the rest
Clean_DailySleep <- Clean_DailySleep %>%
group_by(Id, SleepDay) %>%
slice(1) %>%
ungroup()
# Checking any duplicate based on user + date on Clean_HourlySteps dataframe
sum(duplicated(Clean_HourlySteps[, c("Id", "ActivityHour")]))
## [1] 175
There are at least 175 duplicates on the HourlySteps / Clean_HourlySteps dataframe with the same value, just like the previous dataframe. The next stap is to remove the duplicate and only keep the first occurance.
# Keep only the first occurrence, and remove the rest
Clean_HourlySteps <- Clean_HourlySteps %>%
group_by(Id, ActivityHour) %>%
slice(1) %>%
ungroup()
Now, all of the dataframe no longer has any duplicate data.
Next is to figure out if there are missing values in each data frame.
# 1. DailyActivity
colSums(is.na(Clean_DailyActivity))
## Id ActivityDate TotalSteps
## 0 0 0
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 0 0 0
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 0 0 0
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 0 0 0
## LightlyActiveMinutes SedentaryMinutes Calories
## 0 0 0
colMeans(is.na(Clean_DailyActivity)) * 100
## Id ActivityDate TotalSteps
## 0 0 0
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 0 0 0
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 0 0 0
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 0 0 0
## LightlyActiveMinutes SedentaryMinutes Calories
## 0 0 0
# 2. DailySleep
colSums(is.na(Clean_DailySleep))
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 0 0 0 0
## TotalTimeInBed hour
## 0 0
colMeans(is.na(Clean_DailySleep)) * 100
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 0 0 0 0
## TotalTimeInBed hour
## 0 0
# 3. HourlySteps
colSums(is.na(Clean_HourlySteps))
## Id ActivityHour StepTotal hour
## 0 0 0 0
colMeans(is.na(Clean_HourlySteps)) * 100
## Id ActivityHour StepTotal hour
## 0 0 0 0
Based on this information, there is evidence that no missing values in each dataframe, and the Data are now clean and ready for analysis.
Let’s have a look at the summary statistics of the DailyActivity dataset to find out the overall activity patterns.
# Core Activity Summary
Clean_DailyActivity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes,
LightlyActiveMinutes,
FairlyActiveMinutes,
VeryActiveMinutes,
Calories) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes LightlyActiveMinutes
## Min. : 0 Min. : 0.000 Min. : 0 Min. : 0.0
## 1st Qu.: 3321 1st Qu.: 2.280 1st Qu.: 734 1st Qu.:117.0
## Median : 7142 Median : 5.030 Median :1062 Median :196.0
## Mean : 7377 Mean : 5.289 Mean :1001 Mean :188.1
## 3rd Qu.:10645 3rd Qu.: 7.570 3rd Qu.:1246 3rd Qu.:263.0
## Max. :36019 Max. :28.030 Max. :1440 Max. :720.0
## FairlyActiveMinutes VeryActiveMinutes Calories
## Min. : 0.0 Min. : 0.00 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:1820
## Median : 6.0 Median : 2.00 Median :2129
## Mean : 13.6 Mean : 19.87 Mean :2295
## 3rd Qu.: 18.0 3rd Qu.: 30.00 3rd Qu.:2781
## Max. :660.0 Max. :210.00 Max. :4900
After analyzing 940 daily activity records from 35 users, several key patterns emerged:
1. Overall Activity Levels - Users averaged 7,377 steps per day, which is below the commonly recommended 10,000 steps. The median of 7,142 steps indicates that half of all recorded days had even fewer steps, suggesting significant room for improvement in daily activity levels.
2. Sedentary Lifestyle Concerns - Perhaps most concerning, users were sedentary for an average of 16.7 hours per day. This high sedentary time is consistent with desk-based work environments but presents health risks and a key opportunity for intervention.
3. Exercise Intensity - While users averaged 188 minutes of light activity daily, they only achieved about 34 minutes of moderate-to-vigorous physical activity (fairly + very active minutes combined). This falls slightly above the WHO minimum recommendation of 21 minutes per day but suggests most users could benefit from more intense exercise.
Let’s make some visualizations based on this
# Histogram of daily steps
ggplot(Clean_DailyActivity, aes(x = TotalSteps)) +
geom_histogram(binwidth = 1000, fill = "#2E86AB", color = "white") +
geom_vline(aes(xintercept = mean(TotalSteps)),
color = "red", linetype = "dashed", size = 1) +
labs(title = "Distribution of Daily Steps",
subtitle = paste("Average:", round(mean(Clean_DailyActivity$TotalSteps)), "steps"),
x = "Total Steps",
y = "Frequency") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The distribution reveals significant variability in daily activity. A notable spike at zero steps indicates days when users either didn’t wear their devices or were completely inactive. The majority of activity days cluster between 5,000-10,000 steps, with a small group of highly active days exceeding 20,000 steps. This suggests Bellabeat should focus on both consistency (reducing zero-step days) and motivation (helping users reach 10,000 steps more regularly).
# Histogram of calories
ggplot(Clean_DailyActivity, aes(x = Calories)) +
geom_histogram(binwidth = 200, fill = "#A23B72", color = "white") +
geom_vline(aes(xintercept = mean(Calories)),
color = "red", linetype = "dashed", size = 1) +
labs(title = "Distribution of Daily Calories Burned",
subtitle = paste("Average:", round(mean(Clean_DailyActivity$Calories)), "calories"),
x = "Calories",
y = "Frequency") +
theme_minimal()
Calories expenditure follows a more normal distribution centered around 2,000-2,500 calories per day, which aligns with typical adult basal metabolic rates plus light activity. The distribution suggests most users maintain relatively consistent daily energy expenditure, even when step counts vary significantly. This indicates that factors beyond steps (such as body composition and non-tracked activities) play important roles in total calorie burn.
# Calculate total active vs sedentary time
ActivityBreakdown <- Clean_DailyActivity %>%
summarise(
Sedentary = mean(SedentaryMinutes),
`Lightly Active` = mean(LightlyActiveMinutes),
`Fairly Active` = mean(FairlyActiveMinutes),
`Very Active` = mean(VeryActiveMinutes)
) %>%
pivot_longer(everything(), names_to = "Activity_Type", values_to = "Minutes")
# bar chart
ggplot(ActivityBreakdown, aes(x = reorder(Activity_Type, -Minutes), y = Minutes, fill = Activity_Type)) +
geom_col() +
geom_text(aes(label = round(Minutes, 0)), vjust = -0.5) +
scale_fill_manual(values = c("Sedentary" = "#E63946",
"Lightly Active" = "#F4A261",
"Fairly Active" = "#2A9D8F",
"Very Active" = "#264653")) +
labs(title = "Average Daily Activity Minutes by Intensity",
x = "Activity Level",
y = "Minutes") +
theme_minimal() +
theme(legend.position = "none")
This visualization reveals the most concerning finding of our analysis: Users spend an overwhelming 1,001 minutes (16.7 hours) per day sedentary, compared to just 34 minutes of moderate-to-vigorous physical activity. This 30:1 ratio of sitting to meaningful exercise represents a significant health risk and a major opportunity for Bellabeat intervention. While users do achieve 188 minutes of light activity (likely walking, household tasks), they fall far short of recommended exercise levels.
# Box plot for steps (shows outliers and distribution)
ggplot(Clean_DailyActivity, aes(y = TotalSteps)) +
geom_boxplot(fill = "#2E86AB", alpha = 0.7) +
labs(title = "Daily Steps Distribution with Outliers",
y = "Total Steps") +
theme_minimal() +
theme(axis.text.x = element_blank())
The box plot confirms the wide variability in user behavior, with 50% of days falling between 3,300-10,600 steps. Numerous outliers above 20,000 steps suggest users are capable of high activity but don’t sustain it consistently. This indicates Bellabeat should focus on helping users maintain moderate consistency (7,000-10,000 steps daily) rather than encouraging occasional extreme activity days.
user_summary <- Clean_DailyActivity %>%
group_by(Id) %>%
summarise(
avg_steps = mean(TotalSteps),
avg_calories = mean(Calories),
avg_active_minutes = mean(VeryActiveMinutes + FairlyActiveMinutes),
avg_sedentary_minutes = mean(SedentaryMinutes),
tracking_days = n()
) %>%
ungroup()
# View the data
head(user_summary)
## # A tibble: 6 × 6
## Id avg_steps avg_calories avg_active_minutes avg_sedentary_minutes
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 12175. 1845. 56.7 850.
## 2 1624580081 5137. 1449. 9.67 1279.
## 3 1644430081 7781. 2838. 37.8 1130.
## 4 1844505072 2944. 1614. 1.48 1176.
## 5 1927972279 1299. 2225. 2.02 1241.
## 6 2022484408 11711. 2533. 57.9 1111.
## # ℹ 1 more variable: tracking_days <int>
summary(user_summary)
## Id avg_steps avg_calories avg_active_minutes
## Min. :1.504e+09 Min. : 773.6 Min. :1449 Min. : 0.2619
## 1st Qu.:2.610e+09 1st Qu.: 4472.0 1st Qu.:1892 1st Qu.: 8.6855
## Median :4.445e+09 Median : 7363.0 Median :2192 Median : 26.2667
## Mean :4.845e+09 Mean : 7078.9 Mean :2275 Mean : 34.0384
## 3rd Qu.:6.869e+09 3rd Qu.: 8671.9 3rd Qu.:2637 3rd Qu.: 54.0292
## Max. :8.878e+09 Max. :16759.4 Max. :3488 Max. :115.2439
## avg_sedentary_minutes tracking_days
## Min. : 656.2 Min. : 8.00
## 1st Qu.: 781.1 1st Qu.:38.50
## Median :1099.9 Median :42.00
## Mean :1012.4 Mean :39.23
## 3rd Qu.:1197.0 3rd Qu.:42.00
## Max. :1369.3 Max. :62.00
user_summary <- user_summary %>%
mutate(
user_type = case_when(
avg_steps < 5000 ~ "Sedentary",
avg_steps < 7500 ~ "Lightly Active",
avg_steps < 10000 ~ "Fairly Active",
TRUE ~ "Very Active"
)
)
# Show each category with the percentages
user_summary %>%
count(user_type) %>%
mutate(percentage = n / sum(n) * 100)
## # A tibble: 4 × 3
## user_type n percentage
## <chr> <int> <dbl>
## 1 Fairly Active 8 22.9
## 2 Lightly Active 9 25.7
## 3 Sedentary 11 31.4
## 4 Very Active 7 20
Now let’s create a visualization based on this
user_type_summary <- user_summary %>%
count(user_type) %>%
mutate(percentage = n / sum(n) * 100)
ggplot(user_type_summary, aes(x = reorder(user_type, -n), y = n, fill = user_type)) +
geom_col() +
geom_text(aes(label = paste0(n, " (", round(percentage, 1), "%)")),
vjust = -0.5, size = 4) +
scale_fill_manual(values = c("Sedentary" = "#E63946",
"Lightly Active" = "#F4A261",
"Fairly Active" = "#2A9D8F",
"Very Active" = "#264653")) +
labs(title = "User Activity Level Distribution",
subtitle = "Most users (57%) are sedentary or lightly active",
x = "Activity Level",
y = "Number of Users") +
theme_minimal() +
theme(legend.position = "none") +
ylim(0, 13)
Interpretation:
1. The majority need significant support (57%): The largest segment consist of sedentary (31.4%) and lightly active (25.7%) users, together representing 57% of the user nase. These 20 users are not meeting the CDC’s recommended 10.000 daily steps and represent Bellabeat’s target audience for intervention.
2. Even distribution across activity levels: Unlike a typical population where most people would cluster in one category, our users show relatively even distribution across all four levels (ranging from 20-31%). This suggests diverse fitness and motivations within the user base.
3. Only 1 in 5 users meets recommended goals: Just 7 users (20%) achieve the “very active” classification of 10.000+ steps daily. This reveals a significant opportunity: 80% of users need help reaching optimal activity levels. 4. The “Almost There” group: The 8 “Fairly Active” users (22.9%) averaging 7.500 - 10.000 steps represent a key opportunity—they’re close to meeting goals and may respond well to targeted motivation to push them over te 10.000 step threshold.
# Scatter plot: Steps vs. Calories
ggplot(Clean_DailyActivity, aes(x = TotalSteps, y = Calories)) +
geom_point(alpha = 0.5, color = "#2E86AB") +
geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
labs(title = "Relationship Between Daily Steps and Calories Burned",
subtitle = paste("Correlation:",
round(cor(Clean_DailyActivity$TotalSteps, Clean_DailyActivity$Calories), 3)),
x = "Total Steps",
y = "Calories Burned") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Calculate correlation
cor(Clean_DailyActivity$TotalSteps, Clean_DailyActivity$Calories)
## [1] 0.5800765
From this visualization, it is evident that there is a Strong Correlation between the relationship of Daily Steps and Calories Burned with Correlation Coefficient 0.58. While steps show a solid correlation with calories, it’s weaker than distance and intensity, let’s take a look further to find out.
# Create total active minutes variable
Clean_DailyActivity <- Clean_DailyActivity %>%
mutate(TotalActiveMinutes = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes)
# Scatter plot: Active Minutes vs. Calories
ggplot(Clean_DailyActivity, aes(x = TotalActiveMinutes, y = Calories)) +
geom_point(alpha = 0.5, color = "#2A9D8F") +
geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
labs(title = "Relationship Between Active Minutes and Calories Burned",
subtitle = paste("Correlation:",
round(cor(Clean_DailyActivity$TotalActiveMinutes, Clean_DailyActivity$Calories), 3)),
x = "Total Active Minutes",
y = "Calories Burned") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Calculate correlation
cor(Clean_DailyActivity$TotalActiveMinutes, Clean_DailyActivity$Calories)
## [1] 0.4689218
Next, from this scatter plot visualization, it is evident that there is a Moderate Correlation between Active Minutes and Calories Burned with Correlation Coefficient 0.469. Surprisingly, total active minutes (including light, fair, and very active) shows the weakest correlation.
# Scatter plot: Very Active Minutes vs. Calories
ggplot(Clean_DailyActivity, aes(x = VeryActiveMinutes, y = Calories)) +
geom_point(alpha = 0.5, color = "#F4A261") +
geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
labs(title = "Impact of High-Intensity Activity on Calories Burned",
subtitle = paste("Correlation:",
round(cor(Clean_DailyActivity$VeryActiveMinutes, Clean_DailyActivity$Calories), 3)),
x = "Very Active Minutes",
y = "Calories Burned") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Calculate correlation
cor(Clean_DailyActivity$VeryActiveMinutes, Clean_DailyActivity$Calories)
## [1] 0.593829
Moving on to High-Intensity Activity vs. Calories Burned. From this scatter plot, it is evident that there is a Strong Correlation between High-Intensity and Calories Burned with Correlation Coefficient 0.594. Very active minutes show the second-strongest correlation with calories, despite users averaging only 20 minutes daily of this intensity.
This reveals an important insight: quality matters as much as quantity.
# Scatter plot: Distance vs. Calories
ggplot(Clean_DailyActivity, aes(x = TotalDistance, y = Calories)) +
geom_point(alpha = 0.5, color = "#264653") +
geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
labs(title = "Relationship Between Distance Traveled and Calories Burned",
subtitle = paste("Correlation:",
round(cor(Clean_DailyActivity$TotalDistance, Clean_DailyActivity$Calories), 3)),
x = "Total Distance (km)",
y = "Calories Burned") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Calculate correlation
cor(Clean_DailyActivity$TotalDistance, Clean_DailyActivity$Calories)
## [1] 0.6295828
Last, there is a Strong Correlation between Distance and Calories burned with Coefficient Correlation 0.63. The relationship between total distance traveled and calories burned shows the highest correlation. This makes intuitive sense—covering more ground requires more energy expenditure regardless of pace.
# Add day of week column
Clean_DailyActivity <- Clean_DailyActivity %>%
mutate(
day_of_week = wday(ActivityDate, label = TRUE), # Mon, Tue, Wed...
day_type = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday")
)
# Check it worked
head(Clean_DailyActivity %>% select(ActivityDate, day_of_week, day_type))
## # A tibble: 6 × 3
## ActivityDate day_of_week day_type
## <date> <ord> <chr>
## 1 2016-03-25 Fri Weekday
## 2 2016-03-26 Sat Weekend
## 3 2016-03-27 Sun Weekend
## 4 2016-03-28 Mon Weekday
## 5 2016-03-29 Tue Weekday
## 6 2016-03-30 Wed Weekday
# Summary stats by day type
Clean_DailyActivity %>%
group_by(day_type) %>%
summarise(
avg_steps = mean(TotalSteps),
avg_calories = mean(Calories),
avg_distance = mean(TotalDistance),
avg_active_minutes = mean(VeryActiveMinutes + FairlyActiveMinutes),
avg_sedentary = mean(SedentaryMinutes)
)
## # A tibble: 2 × 6
## day_type avg_steps avg_calories avg_distance avg_active_minutes avg_sedentary
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Weekday 7453. 2302. 5.34 33.7 1008.
## 2 Weekend 7188. 2277. 5.16 32.9 984.
From these results, it is shown that there is NO Significant Difference between Weekday and Weekend. The difference only ~265 steps.
# Calculate average by day of week
daily_summary <- Clean_DailyActivity %>%
group_by(day_of_week) %>%
summarise(
avg_steps = mean(TotalSteps),
avg_calories = mean(Calories),
count = n()
)
# Line chart showing weekly pattern
ggplot(daily_summary, aes(x = day_of_week, y = avg_steps, group = 1)) +
geom_line(color = "#2E86AB", size = 1.2) +
geom_point(color = "#2E86AB", size = 3) +
labs(title = "Average Daily Steps by Day of Week",
subtitle = "Do users maintain consistency throughout the week?",
x = "Day of Week",
y = "Average Steps") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0))
# Box plot comparing weekday vs weekend
ggplot(Clean_DailyActivity, aes(x = day_type, y = TotalSteps, fill = day_type)) +
geom_boxplot() +
scale_fill_manual(values = c("Weekday" = "#2E86AB", "Weekend" = "#E63946")) +
labs(title = "Activity Levels: Weekdays vs Weekends",
x = "Day Type",
y = "Total Steps") +
theme_minimal() +
theme(legend.position = "none")
# Bar chart with multiple metrics
Clean_DailyActivity %>%
group_by(day_type) %>%
summarise(
Steps = mean(TotalSteps),
Calories = mean(Calories),
`Active Minutes` = mean(VeryActiveMinutes + FairlyActiveMinutes)
) %>%
pivot_longer(cols = -day_type, names_to = "Metric", values_to = "Value") %>%
ggplot(aes(x = day_type, y = Value, fill = day_type)) +
geom_col() +
facet_wrap(~Metric, scales = "free_y") +
scale_fill_manual(values = c("Weekday" = "#2E86AB", "Weekend" = "#E63946")) +
labs(title = "Weekday vs Weekend Activity Comparison",
x = "",
y = "Average Value") +
theme_minimal() +
theme(legend.position = "none")
# Step 1: Convert ActivityHour to datetime and extract hour
HourlySteps <- HourlySteps %>%
mutate(
ActivityHour = mdy_hms(ActivityHour), # Convert to datetime
hour = hour(ActivityHour) # Extract just the hour (0-23)
)
# Step 2: NOW create hourly pattern grouped by hour only
hourly_pattern <- HourlySteps %>%
group_by(hour) %>% # Group by hour (0-23), not full datetime
summarise(
avg_steps = mean(StepTotal),
median_steps = median(StepTotal),
total_records = n()
) %>%
arrange(hour)
# View the results
head(hourly_pattern)
## # A tibble: 6 × 4
## hour avg_steps median_steps total_records
## <int> <dbl> <dbl> <int>
## 1 0 43.4 0 1955
## 2 1 21.7 0 1954
## 3 2 13.7 0 1954
## 4 3 6.89 0 1952
## 5 4 11.2 0 1950
## 6 5 34.6 0 1949
ggplot(hourly_pattern, aes(x = hour, y = avg_steps)) +
geom_line(color = "#2E86AB", linewidth = 1.2) +
geom_point(color = "#2E86AB", size = 3) +
geom_area(alpha = 0.3, fill = "#2E86AB") +
scale_x_continuous(breaks = seq(0, 23, 2),
labels = c("12AM", "2AM", "4AM", "6AM", "8AM", "10AM",
"12PM", "2PM", "4PM", "6PM", "8PM", "10PM")) +
labs(title = "Average Hourly Step Count Throughout the Day",
subtitle = "Peak activity occurs between 5-7 PM (evening)",
x = "Hour of Day",
y = "Average Steps per Hour") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation:
The hourly activity pattern reveals distinct behavioral trends throughout the day:
1. Sleep Period (12 AM - 5 AM): Minimal activity (~20-50 steps/hour) - Users are clearly asleep during these hours - Baseline activity from restless movements
2. Morning Ramp-Up (6 AM - 8 AM): Sharp increase (150 → 400 steps/hour) - Morning routines and commutes begin - Activity doubles within 2 hours
3. Mid-Morning Plateau (9 AM - 11 AM): Sustained moderate activity (~450-500 steps/hour) - Likely reflects desk work with occasional movement - Office workers taking short walks
4. Lunch Peak (12 PM - 1 PM): First daily peak (~550 steps/hour) - Lunch break activity - Walking to restaurants or outdoor breaks
5. Afternoon Dip (3 PM): Notable decrease (~400 steps/hour) - Post-lunch slump - “Dead zone” for activity
6. Evening Peak (5 PM - 7 PM): Highest activity of the day (600+ steps/hour) - 6 PM shows maximum activity - Post-work exercise, evening walks - Commute home + intentional fitness
7. Evening Wind-Down (8 PM - 11 PM): Gradual decline (380 → 200 steps/hour) - Dinner, relaxation, home activities - Preparing for sleep
# Overall sleep summary
Clean_DailySleep %>%
mutate(sleep_hours = TotalMinutesAsleep / 60) %>%
select(TotalMinutesAsleep, sleep_hours, TotalTimeInBed) %>%
summary()
## TotalMinutesAsleep sleep_hours TotalTimeInBed
## Min. : 58.0 Min. : 0.9667 Min. : 61.0
## 1st Qu.:361.0 1st Qu.: 6.0167 1st Qu.:403.8
## Median :432.5 Median : 7.2083 Median :463.0
## Mean :419.2 Mean : 6.9862 Mean :458.5
## 3rd Qu.:490.0 3rd Qu.: 8.1667 3rd Qu.:526.0
## Max. :796.0 Max. :13.2667 Max. :961.0
# More detailed summary
Clean_DailySleep %>%
mutate(
sleep_hours = TotalMinutesAsleep / 60,
time_in_bed_hours = TotalTimeInBed / 60,
sleep_efficiency = (TotalMinutesAsleep / TotalTimeInBed) * 100
) %>%
summarise(
avg_sleep_hours = mean(sleep_hours),
median_sleep_hours = median(sleep_hours),
min_sleep_hours = min(sleep_hours),
max_sleep_hours = max(sleep_hours),
avg_time_in_bed = mean(time_in_bed_hours),
avg_sleep_efficiency = mean(sleep_efficiency),
total_sleep_records = n(),
unique_users = n_distinct(Id)
)
## # A tibble: 1 × 8
## avg_sleep_hours median_sleep_hours min_sleep_hours max_sleep_hours
## <dbl> <dbl> <dbl> <dbl>
## 1 6.99 7.21 0.967 13.3
## # ℹ 4 more variables: avg_time_in_bed <dbl>, avg_sleep_efficiency <dbl>,
## # total_sleep_records <int>, unique_users <int>
Average Sleep: 6.99 hours (just below recommended 7 hours). Users are almost getting enough sleep, but not quite.
# Calculate average sleep per user
user_sleep_summary <- Clean_DailySleep %>%
mutate(
sleep_hours = TotalMinutesAsleep / 60,
sleep_efficiency = (TotalMinutesAsleep / TotalTimeInBed) * 100
) %>%
group_by(Id) %>%
summarise(
avg_sleep_hours = mean(sleep_hours),
avg_time_in_bed = mean(TotalTimeInBed) / 60,
avg_sleep_efficiency = mean(sleep_efficiency),
sleep_records = n()
) %>%
ungroup()
# View the results
head(user_sleep_summary)
## # A tibble: 6 × 5
## Id avg_sleep_hours avg_time_in_bed avg_sleep_efficiency sleep_records
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 1503960366 6.00 6.39 93.6 25
## 2 1644430081 4.9 5.77 88.2 4
## 3 1844505072 10.9 16.0 67.8 3
## 4 1927972279 6.95 7.30 94.7 5
## 5 2026352035 8.44 8.96 94.1 28
## 6 2320127002 1.02 1.15 88.4 1
summary(user_sleep_summary)
## Id avg_sleep_hours avg_time_in_bed avg_sleep_efficiency
## Min. :1.504e+09 Min. : 1.017 Min. : 1.150 Min. :63.37
## 1st Qu.:2.340e+09 1st Qu.: 5.605 1st Qu.: 6.284 1st Qu.:91.40
## Median :4.502e+09 Median : 6.954 Median : 7.434 Median :93.97
## Mean :4.764e+09 Mean : 6.291 Mean : 6.999 Mean :91.30
## 3rd Qu.:6.822e+09 3rd Qu.: 7.488 3rd Qu.: 8.121 3rd Qu.:94.87
## Max. :8.792e+09 Max. :10.867 Max. :16.017 Max. :98.49
## sleep_records
## Min. : 1.00
## 1st Qu.: 4.75
## Median :20.50
## Mean :17.08
## 3rd Qu.:27.25
## Max. :31.00
Sleep Efficiency: 91.3% (EXCELLENT!)
- Above the 85% threshold for good sleep quality
- Users fall asleep quickly and stay asleep
- Median of 94% means most users have very efficient sleep
- Range: 63%-98% shows most users sleep well once in bed
Let’s see the visualization based on this information
# Sleep efficiency histogram
Clean_DailySleep %>%
mutate(sleep_efficiency = (TotalMinutesAsleep / TotalTimeInBed) * 100) %>%
ggplot(aes(x = sleep_efficiency)) +
geom_histogram(binwidth = 2, fill = "#1982C4", color = "white") +
geom_vline(aes(xintercept = 85), color = "red", linetype = "dashed", linewidth = 1) +
annotate("text", x = 80, y = 40, label = "85% efficiency\nthreshold", color = "red") +
labs(title = "Sleep Efficiency Distribution",
subtitle = "Sleep efficiency = (Time Asleep / Time in Bed) × 100%",
x = "Sleep Efficiency (%)",
y = "Frequency") +
theme_minimal()
# Classify users by sleep duration
user_sleep_summary <- user_sleep_summary %>%
mutate(
sleep_category = case_when(
avg_sleep_hours < 6 ~ "Insufficient Sleep (<6h)",
avg_sleep_hours < 7 ~ "Below Recommended (6-7h)",
avg_sleep_hours <= 9 ~ "Recommended (7-9h)",
TRUE ~ "Excessive Sleep (>9h)"
)
)
# See distribution
table(user_sleep_summary$sleep_category)
##
## Below Recommended (6-7h) Excessive Sleep (>9h) Insufficient Sleep (<6h)
## 5 1 8
## Recommended (7-9h)
## 10
# With percentages
user_sleep_summary %>%
count(sleep_category) %>%
mutate(percentage = n / sum(n) * 100) %>%
arrange(desc(n))
## # A tibble: 4 × 3
## sleep_category n percentage
## <chr> <int> <dbl>
## 1 Recommended (7-9h) 10 41.7
## 2 Insufficient Sleep (<6h) 8 33.3
## 3 Below Recommended (6-7h) 5 20.8
## 4 Excessive Sleep (>9h) 1 4.17
Let’s create visualization based on this
# Bar chart of sleep categories
user_sleep_summary %>%
count(sleep_category) %>%
mutate(
percentage = n / sum(n) * 100,
sleep_category = factor(sleep_category,
levels = c("Insufficient Sleep (<6h)",
"Below Recommended (6-7h)",
"Recommended (7-9h)",
"Excessive Sleep (>9h)"))
) %>%
ggplot(aes(x = sleep_category, y = n, fill = sleep_category)) +
geom_col() +
geom_text(aes(label = paste0(n, " users\n(", round(percentage, 1), "%)")),
vjust = -0.3, size = 4) +
scale_fill_manual(values = c("Insufficient Sleep (<6h)" = "#E63946",
"Below Recommended (6-7h)" = "#F4A261",
"Recommended (7-9h)" = "#2A9D8F",
"Excessive Sleep (>9h)" = "#264653")) +
labs(title = "Sleep Quality Classification",
subtitle = "Based on CDC recommendations (7-9 hours for adults)",
x = "",
y = "Number of Users") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1)) +
ylim(0, max(table(user_sleep_summary$sleep_category)) + 3)
Interpretation:
- The largest group with total of 10 users are users with 7-9 hours of sleep. These individuals meet standard health guidelines for sleep duration.
- ‘Insufficient Sleep (<6h)’ is the second largest group with total of 8 users. These users are significantly sleep-deprived.
- ‘Below Recommended (6-7h)’, these users are slightly under the target with total of 5 users, often referred to as “short sleepers.
- A very small minority who sleep longer than the typical clinical recommendation came from the group with excessive sleep (>9h), with only 1 user.
# Histogram of sleep hours
Clean_DailySleep %>%
mutate(sleep_hours = TotalMinutesAsleep / 60) %>%
ggplot(aes(x = sleep_hours)) +
geom_histogram(binwidth = 0.5, fill = "#6A4C93", color = "white") +
geom_vline(aes(xintercept = 7), color = "green", linetype = "dashed", linewidth = 1) +
geom_vline(aes(xintercept = 9), color = "green", linetype = "dashed", linewidth = 1) +
annotate("text", x = 8, y = 50, label = "Recommended\n7-9 hours", color = "green") +
labs(title = "Distribution of Sleep Duration",
subtitle = "Green lines indicate CDC recommended range (7-9 hours)",
x = "Hours of Sleep",
y = "Frequency") +
theme_minimal()
Next analysis is to find out the relationship between Sleep and overall activity. But since the data type of SleepDay on Clean_DailySleep is still POSIXct, I need to change the format to date before merging it with Daily Activity dataframe.
# Convert SleepDay from POSIXct to Date
Clean_DailySleep <- Clean_DailySleep %>%
mutate(SleepDay = as.Date(SleepDay))
# Verify the conversion
class(Clean_DailySleep$SleepDay)
## [1] "Date"
head(Clean_DailySleep$SleepDay)
## [1] "2016-04-12" "2016-04-13" "2016-04-15" "2016-04-16" "2016-04-17"
## [6] "2016-04-19"
Now the format is set to date, and ready to merge.
# Merge Daily Activity and Daily Sleep dataframe
sleep_activity <- Clean_DailyActivity %>%
inner_join(Clean_DailySleep, by = c("Id" = "Id", "ActivityDate" = "SleepDay"))
# View it
head(sleep_activity)
## # A tibble: 6 × 22
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-15 9762 6.28 6.28
## 4 1503960366 2016-04-16 12669 8.16 8.16
## 5 1503960366 2016-04-17 9705 6.48 6.48
## 6 1503960366 2016-04-19 15506 9.88 9.88
## # ℹ 17 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## # TotalActiveMinutes <dbl>, day_of_week <ord>, day_type <chr>,
## # TotalSleepRecords <dbl>, TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>, …
# Add sleep-related variables
sleep_activity <- sleep_activity %>%
mutate(
sleep_hours = TotalMinutesAsleep / 60,
sleep_efficiency = (TotalMinutesAsleep / TotalTimeInBed) * 100,
total_active_minutes = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes
)
# Summary
summary(sleep_activity %>% select(sleep_hours, TotalSteps, Calories, total_active_minutes))
## sleep_hours TotalSteps Calories total_active_minutes
## Min. : 0.9667 Min. : 17 Min. : 257 Min. : 2.0
## 1st Qu.: 6.0167 1st Qu.: 5189 1st Qu.:1841 1st Qu.:206.5
## Median : 7.2083 Median : 8913 Median :2207 Median :263.5
## Mean : 6.9862 Mean : 8515 Mean :2389 Mean :259.5
## 3rd Qu.: 8.1667 3rd Qu.:11370 3rd Qu.:2920 3rd Qu.:315.5
## Max. :13.2667 Max. :22770 Max. :4900 Max. :540.0
Next, the relationship I want to find out are:
1. Is there any significant correlation between Sleep Hour vs. Steps?
2. Is there any significant correlation between Sleep Hour vs. Calories? Does sleeps more can burn calories?
3. Is there any significant correlation between Sleep Hour vs. Active Minutes?
4. Is there any significant correlation between Sleep Efficiency vs. Steps?
# Key correlations
cor(sleep_activity$sleep_hours, sleep_activity$TotalSteps)
## [1] -0.1903439
cor(sleep_activity$sleep_hours, sleep_activity$Calories)
## [1] -0.03169899
cor(sleep_activity$sleep_hours, sleep_activity$total_active_minutes)
## [1] -0.06929398
cor(sleep_activity$sleep_efficiency, sleep_activity$TotalSteps)
## [1] -0.1100255
# Create a correlation summary
sleep_correlations <- data.frame(
Metric = c("Sleep Hours vs Steps",
"Sleep Hours vs Calories",
"Sleep Hours vs Active Minutes",
"Sleep Efficiency vs Steps"),
Correlation = c(
cor(sleep_activity$sleep_hours, sleep_activity$TotalSteps),
cor(sleep_activity$sleep_hours, sleep_activity$Calories),
cor(sleep_activity$sleep_hours, sleep_activity$total_active_minutes),
cor(sleep_activity$sleep_efficiency, sleep_activity$TotalSteps)
)
)
print(sleep_correlations)
## Metric Correlation
## 1 Sleep Hours vs Steps -0.19034392
## 2 Sleep Hours vs Calories -0.03169899
## 3 Sleep Hours vs Active Minutes -0.06929398
## 4 Sleep Efficiency vs Steps -0.11002554
Before finally revealing the interpretation behind these results, let’s create a visualization based on this
# Visualization 1: Sleep vs Steps
ggplot(sleep_activity, aes(x = sleep_hours, y = TotalSteps)) +
geom_point(alpha = 0.5, color = "#6A4C93") +
geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
geom_vline(xintercept = 7, linetype = "dashed", color = "green", alpha = 0.5) +
geom_vline(xintercept = 9, linetype = "dashed", color = "green", alpha = 0.5) +
annotate("text", x = 8, y = max(sleep_activity$TotalSteps) * 0.9,
label = "Recommended\n7-9 hours", color = "green", size = 3) +
labs(title = "Sleep Duration vs Daily Steps: No Positive Relationship",
subtitle = "Correlation: -0.19 (weak negative)",
x = "Hours of Sleep",
y = "Total Steps") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The Surprising Finding: No Positive Correlation
Contrary to conventional wisdom that “better sleep leads to more activity,” this analysis reveals no positive relationship between sleep duration and physical activity levels.
Correlation Analysis:
- Sleep Hours vs Steps: -0.19 (weak negative)
- Sleep Hours vs Calories: -0.03 (essentially zero)
- Sleep Hours vs Active Minutes: -0.07 (essentially zero)
- Sleep Efficiency vs Steps: -0.11 (weak negative)
The scatter plot visualization clearly shows a slight downward trend - this means users who sleep more tend to take slightly fewer steps, though the relationship is weak and highly variable.
Next, let’s find out the relationship between Daily Steps and Sleep Quality.
# Average steps by sleep category
sleep_activity <- sleep_activity %>%
mutate(
sleep_category = case_when(
sleep_hours < 6 ~ "Insufficient (<6h)",
sleep_hours < 7 ~ "Below Rec. (6-7h)",
sleep_hours <= 9 ~ "Recommended (7-9h)",
TRUE ~ "Excessive (>9h)"
),
sleep_category = factor(sleep_category,
levels = c("Insufficient (<6h)",
"Below Rec. (6-7h)",
"Recommended (7-9h)",
"Excessive (>9h)"))
)
# Bar chart
sleep_activity %>%
group_by(sleep_category) %>%
summarise(
avg_steps = mean(TotalSteps),
count = n()
) %>%
ggplot(aes(x = sleep_category, y = avg_steps, fill = sleep_category)) +
geom_col() +
geom_text(aes(label = paste0(round(avg_steps, 0), " steps\n(n=", count, ")")),
vjust = -0.3, size = 3.5) +
scale_fill_manual(values = c("Insufficient (<6h)" = "#E63946",
"Below Rec. (6-7h)" = "#F4A261",
"Recommended (7-9h)" = "#2A9D8F",
"Excessive (>9h)" = "#264653")) +
labs(title = "Average Daily Steps by Sleep Quality",
subtitle = "Well-rested users don't necessarily take more steps",
x = "",
y = "Average Steps") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1)) +
ylim(0, max(tapply(sleep_activity$TotalSteps, sleep_activity$sleep_category, mean)) * 1.15)
This analysis is to understanding how consistently users engage with their fitness trackers reveals critical insights about user behavior and product stickiness.
# How many days did each user track activity?
user_tracking <- Clean_DailyActivity %>%
group_by(Id) %>%
summarise(
tracking_days = n(),
avg_steps = mean(TotalSteps)
) %>%
arrange(desc(tracking_days))
# Summary statistics
summary(user_tracking$tracking_days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 38.50 42.00 39.23 42.00 62.00
# View the data
head(user_tracking)
## # A tibble: 6 × 3
## Id tracking_days avg_steps
## <dbl> <int> <dbl>
## 1 4020332650 62 4115.
## 2 1503960366 49 12175.
## 3 1624580081 49 5137.
## 4 4445114986 45 4729.
## 5 4702921684 45 8553
## 6 6962181067 44 10789.
Let’s create a visualization based on this summary
# Viz1: Activity Tracking Distribution
ggplot(user_tracking, aes(x = tracking_days)) +
geom_histogram(binwidth = 5, fill = "#2E86AB", color = "white") +
geom_vline(aes(xintercept = median(tracking_days)),
color = "red", linetype = "dashed", linewidth = 1) +
annotate("text", x = median(user_tracking$tracking_days) + 5, y = 8,
label = paste("Median:", median(user_tracking$tracking_days), "days"),
color = "red") +
labs(title = "User Engagement: Days of Activity Tracking",
subtitle = "How consistently do users wear their devices?",
x = "Number of Days Tracked",
y = "Number of Users") +
theme_minimal()
Activity Tracking: Strong Engagement
Finding: Users demonstrate solid commitment to activity tracking, with a median of 42 days tracked over the study period.
Key Statistics:
- Median: 42 days
- Average: 39.2 days
- Range: 8-62 days
- Most users (20 out of 35) tracked for 38-42+ days
Interpretation: The concentration of users around the 40-day mark suggests good device adoption and habit formation. Users who start tracking tend to stick with it for at least a month, indicating the value proposition for activity tracking is clear and compelling.
# How many days did each user track sleep?
sleep_tracking <- Clean_DailySleep %>%
group_by(Id) %>%
summarise(sleep_days = n()) %>%
arrange(desc(sleep_days))
# Summary statistics
summary(sleep_tracking$sleep_days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.75 20.50 17.08 27.25 31.00
# View it
head(sleep_tracking)
## # A tibble: 6 × 2
## Id sleep_days
## <dbl> <int>
## 1 5553957443 31
## 2 6962181067 31
## 3 8378563200 31
## 4 2026352035 28
## 5 3977333714 28
## 6 4445114986 28
# Merge tracking data
tracking_comparison <- user_tracking %>%
left_join(sleep_tracking, by = "Id") %>%
mutate(sleep_days = ifelse(is.na(sleep_days), 0, sleep_days))
# Summary
tracking_comparison %>%
summarise(
avg_activity_days = mean(tracking_days),
avg_sleep_days = mean(sleep_days, na.rm = TRUE),
users_tracking_sleep = sum(sleep_days > 0)
)
## # A tibble: 1 × 3
## avg_activity_days avg_sleep_days users_tracking_sleep
## <dbl> <dbl> <int>
## 1 39.2 11.7 24
Sleep Tracking: The Engagement Gap
Finding: Sleep tracking shows significantly lower engagement, with a median of only 20.5 days - less than half the activity tracking rate.
Key Statistics:
- Median: 20.5 days (among those who tracked)
- Average across all users: 11.7 days
- 31% of users (11 out of 35) never tracked sleep at all
- Sleep tracking is 50% less consistent than activity tracking
# Viz2: Activity vs. Sleep Tracking Comparison
tracking_long <- tracking_comparison %>%
select(Id, tracking_days, sleep_days) %>%
pivot_longer(cols = c(tracking_days, sleep_days),
names_to = "tracking_type",
values_to = "days") %>%
mutate(tracking_type = ifelse(tracking_type == "tracking_days",
"Activity Tracking",
"Sleep Tracking"))
# Box plot comparison
ggplot(tracking_long, aes(x = tracking_type, y = days, fill = tracking_type)) +
geom_boxplot() +
scale_fill_manual(values = c("Activity Tracking" = "#2E86AB",
"Sleep Tracking" = "#6A4C93")) +
labs(title = "Tracking Consistency: Activity vs Sleep",
subtitle = "Sleep tracking is significantly less consistent",
x = "",
y = "Days Tracked") +
theme_minimal() +
theme(legend.position = "none")
The Dramatic Difference: The box plot visualization clearly shows activity tracking clustered around 40 days, while sleep tracking has a much wider distribution with many users at zero or very low numbers.
# Does tracking consistency correlate with activity?
ggplot(user_tracking, aes(x = tracking_days, y = avg_steps)) +
geom_point(color = "#2E86AB", size = 3, alpha = 0.7) +
geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
labs(title = "Does Consistent Tracking Lead to More Steps?",
subtitle = paste("Correlation:",
round(cor(user_tracking$tracking_days, user_tracking$avg_steps), 3)),
x = "Days Tracked",
y = "Average Daily Steps") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Calculate correlation
cor(user_tracking$tracking_days, user_tracking$avg_steps)
## [1] 0.330517
Engagement Drives Results
Critical Finding: Tracking consistency correlates positively with activity levels (r = 0.33).
Users who tracked for 40+ days averaged ~8,000 steps, while those tracking fewer than 20 days averaged ~4,000 steps.
This suggests:
- Consistent tracking creates accountability
- Regular data viewing motivates behavior change
- Habit formation requires sustained engagement
Bellabeat Case Study: Key findings & strategic recommendations
Analysis of 35 Fitbit users’ activity, sleep, and usage patterns reveals critical opportunities for Bellabeat to differentiate itself in the wellness technology market. My findings challenge conventional wisdom about sleep-activity relationships while identifying clear user segments and engagement gaps.
The Sedentary Crisis (16.7 hours/day) Users are sedentary for over 16 hours daily but achieve only 34 minutes of moderate-to-vigorous activity. This represents the primary health risk and biggest intervention opportunity.
Majority Users Need Support (57%) 31% are sedentary and 26% are lightly active - together representing the core target audience who need motivation and behavioral change support.
Sleep-Activity Paradox Better sleep does NOT lead to more activity (r = -0.19). Sleep and activity are independent behaviors requiring separate interventions, not a simple “sleep more → move more” message.
Sleep Tracking Engagement Gap Only 68% of users track sleep, and those who do track 50% less consistently than activity (20 vs 42 days median). This represents a major UX challenge.
Engagement Drives Results Consistent tracking correlates with higher activity (r = 0.33). Users tracking 40+ days average 8,000 steps vs 4,000 steps for those tracking <20 days.
PRIORITY 1: Combat Sedentary Behavior
The Problem: 16.7 hours/day sedentary time is unsustainable and unhealthy.
Solutions:
PRIORITY 2: Segment-Specific Engagement
The Problem: 57% of users are sedentary or lightly active and need different support than the 20% who are very active.
So here’s the recommendation for every category:
PRIORITY 3: Fix Sleep Tracking UX
The Problem: 31% never track sleep; those who do track 50% less than activity.
Solutions:
Auto-Sleep Detection: Remove the need to manually activate sleep mode
Charging Solutions:
Rapid charging (80% in 30 minutes during morning routine)
Alternative: Two devices (wear one while charging the other)
Comfort First: Smaller, lighter sleep-specific device or redesigned band
Demonstrate Value: Show “Sleep Score” and next-day energy predictions immediately
Bedtime Reminders: “To get 8 hours, go to bed by 10:30 PM”
PRIORITY 4: Optimal Timing for Interventions
The Problem: Users have distinct activity patterns that current generic notifications ignore.
Time-Based Strategy:
PRIORITY 5: Emphasize Intensity Over Volume
The Problem: Users fixate on steps but ignore intensity, despite intensity having stronger correlation with calories (r = 0.59 vs 0.58).
Solutions:
- Intensity Zones: Display time in heart rate zones prominently
- Quality Over Quantity: “20 minutes of vigorous activity > 20,000 light steps”
- 10-Minute HIIT Challenges: For time-constrained users
- Distance + Intensity Metrics: Track both steps and “active distance”
- Reframe Success: “You hit your intensity goal!” not just “You hit your steps!”
PRIORITY 6: Build Habit Formation Features
The Problem: Tracking consistency predicts success, but many users don’t build lasting habits.
Solutions:
PRIORITY 7: Separate Sleep & Activity Messaging
The Problem: No positive correlation between sleep and activity challenges the “sleep better → move more” narrative.
New Approach:
The path to better health isn’t about revolutionary changes - it’s about sustainable, personalized, and consistent small improvements. Bellabeat’s opportunity lies in meeting users where they are, understanding their unique challenges, and providing intelligent, compassionate support that respects the complexity of real human behavior.
The winning formula: Segment intelligently + Reduce friction + Time interventions + Celebrate progress + Build habits.
That’s it from me for the Capstone project in the Bellabeat Case Study. Thank you so much for your interest in the project!
Alifia Ganjaraharja